-
Notifications
You must be signed in to change notification settings - Fork 178
Implement reverse performance optimization
#4775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
4483045 to
0246535
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: I recall the major comment on original PR is early optimization in analyzer layer. Is this new PR trying to address the concern? Ref: #4056 (comment)
Hi Chen, I think that's a valid concern. However, after trying it out, I think it has significant complexity comparing to the current approach. I think |
noCharger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add benchmark results on before VS after?
| // Fallback: use ROW_NUMBER approach when no existing sort | ||
| RexNode rowNumber = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an idea: How about ignore the reverse command if there is not existing collations or no IMPLICIT_FIELD_TIMESTAMP in rowType.
If there is IMPLICIT_FIELD_TIMESTAMP in rowType, no need to add ROW_NUMBER
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! I think it makes sense since reverse command may not have meaningful semantics if there's no ordering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the implementation: Now reverse command is ignored if no collations/@timestamp found
Updated the test cases and documentation as well
This commit optimizes the `reverse` command in the Calcite planner by intelligently reversing existing sort collations instead of always using the ROW_NUMBER() approach. Key changes: - Added PlanUtils.reverseCollation() method to flip sort directions and null directions - Updated CalciteRelNodeVisitor.visitReverse() to: - Check for existing sort collations - Reverse them if present (more efficient) - Fall back to ROW_NUMBER() when no sort exists - Added comprehensive integration test expected outputs for: - Single field reverse pushdown - Multiple field reverse pushdown - Reverse fallback cases - Double reverse no-op optimizations This optimization significantly improves performance when reversing already-sorted data by leveraging database-native sort reversal. Based on PR opensearch-project#4056 by @selsong Signed-off-by: Kai Huang <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
Signed-off-by: Kai Huang <[email protected]>
b16218b to
39aecf6
Compare
Signed-off-by: Kai Huang <[email protected]>
Test is done against big5_v2_first100m dataset(The first 100m docs of big5) Before: Query failed to execute even with head 10 After: |
| } | ||
|
|
||
| @Test | ||
| public void testReverseWithTimestampField() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some ITs for streamstats? the streamstats always sort on __stream_seq__ to keep the output ordering as same as input's. We'd better add some ITs to verify the pipe streamstats | reverse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added ITs for streamstats, since reverse command only works when there's a detectable collation now, using streamstats only will make reverse command ignored. Do we want to keep it this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the streamstats contains collation __stream_seq__ ASC
Signed-off-by: Kai Huang <[email protected]>
Description
Originally from #4056 by @selsong
This PR implements a significant performance optimization for the
reversecommand by eliminating the expensive ROW_NUMBER() window function and implementing a three-tier logic based on query context.Motivation
The previous implementation used ROW_NUMBER() window function which:
Solution: Three-Tier Reverse Logic
The
reversecommand now follows context-aware behavior:Implementation Details
1. Reverse with Explicit Sort (Primary Use Case)
Query:
Behavior: Flips all sort directions:
+balance, -firstname→-balance, +firstnameLogical Plan:
Physical Plan: (efficiently pushes reversed sort to OpenSearch)
2. Reverse with @timestamp (Time-Series Optimization)
Query:
Behavior: When no explicit sort exists but the index has an @timestamp field, reverse automatically sorts by @timestamp DESC to show most recent events first.
Use Case: Common pattern in log analysis - users want recent logs first
Logical Plan:
3. Reverse Ignored (No-Op Case)
Query:
Behavior: When there's no explicit sort AND no @timestamp field, reverse is ignored. Results appear in natural index order.
Rationale: Avoid expensive operations when reverse has no meaningful semantic interpretation.
Logical Plan:
Note: No sort node is added - reverse is completely ignored.
4. Double Reverse (Cancellation)
Query:
Behavior: Two reverses cancel each other out, returning to original sort order.
Logical Plan:
Final sort order matches original query:
+balance, -firstname5. Multiple Sorts + Reverse
Query:
Behavior: Reverse applies to the most recent sort (from PPL semantics, last sort wins).
Logical Plan:
Result: Only
firstnamesort is reversed (DESC → ASC). Thebalancesort is overridden by PPL's "last sort wins" rule.Related Issues
Resolves #3924
Check List
--signoff.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.